• In the realm of large language models (LLMs), the quest for improved mock data generation has gained traction. Mock data, or synthetic data, serves as a crucial tool in software development and testing, allowing developers to simulate real-world scenarios without relying on actual data. Despite its utility, the methods for generating mock data have seen little innovation over the years, prompting a call for a revolution in this area. The concept of "high-fidelity" mock data is central to this discussion. High-fidelity data refers to synthetic data that closely mimics real data, tailored to the specific schema of a database. The goal is to create a seamless, one-click solution that generates realistic data without requiring extensive user input. This is particularly important for platforms like Neurelo, which aim to provide users with a production-like experience even when starting with empty databases. Neurelo's approach to mock data generation is built on five key requirements: diversity across supported data sources (MongoDB, MySQL, and Postgres), the ability to generate realistic data based solely on the schema, cost-effectiveness, fast response times, and the use of native Rust for implementation. Rust's performance and safety features make it an ideal choice for this task. The initial exploration of using LLMs for generating mock data involved creating Rust code to produce raw SQL INSERT queries. However, this approach faced challenges, including issues with code compilation and the quality of generated data, which often defaulted to generic placeholders. Recognizing the limitations of this method, the team pivoted to using Python, leveraging its capabilities alongside the "faker" library to enhance data quality. A significant challenge in mock data generation is maintaining referential integrity, especially when dealing with foreign key relationships across multiple tables. The order of data insertion is critical; for instance, if one table references another, the referenced data must be inserted first. To address this, the team implemented topological sorting, a method that organizes data insertion based on dependencies within the database schema. The complexity increases when dealing with cyclic relationships, which can complicate the insertion order. To manage this, the team proposed breaking cycles by temporarily inserting NULL values during the mock data generation process. This allows for the insertion of data without violating referential integrity constraints. As the project progressed, the team encountered issues with unique constraints, particularly when generating large datasets. The randomness of data generation could lead to duplicate entries, violating unique constraints and causing cascading failures in related tables. To mitigate this, they explored strategies for ensuring uniqueness, including using pre-generated distinct data pools. Despite the challenges, the team successfully developed a mock data generator that meets the initial requirements. However, they recognized the potential for overfitting in their LLM model, where the quality of generated data was overly reliant on the classification pipeline. To enhance the model's accuracy, they integrated table names into the classification process and developed a "Genesis Point Strategy" to efficiently generate unique data. The future of mock data generation at Neurelo looks promising, with plans to tackle more complex challenges, such as supporting composite types and multi-schema environments. The ongoing evolution of this technology aims to provide developers with high-fidelity mock data generation that is both efficient and effective, paving the way for a new standard in software testing and development.